Ancestral-specific reference genomes and methods of constructing

ABSTRACT

Ancestry has a significant impact on the major and minor alleles found in each nucleotide position within the genome. Due to mechanisms of inheritance, ancestral-specific information contained within the genome is conserved within members of an ancestry. For this reason, individuals within a specific ancestry are more likely to share alleles in their genomes with other members of the same ancestry. Functionally, the combination of alleles at all positions within a group of individuals defines that group as having a common ancestry. Moreover, the aggregation of differences between alleles at all positions distinguishes one ancestry from another. The genomic similarities and differences between ancestries provides a mechanism to generate reference genomes that are specific for each ancestry. Reference genomes that are specific to an ancestry can be used to increase the accuracy of whole genome sequencing, DNA-based diagnostics and therapeutic marker discovery and in a variety of real-world DNA-based applications. Provided herein are methods for constructing an ancestral-specific reference genome database.

1. CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a divisional of U.S. Ser. No. 13/834,685, filed Mar.15, 2013, and claims the benefit of U.S. Provisional Patent ApplicationSer. No. 61/694,155, filed on Aug. 28, 2012, each of which areincorporated herein by reference in their entirety.

2. FIELD OF THE INVENTION

Provided herein are ancestral-specific reference genome databases andmethods of making and using such ancestral-specific reference genomedatabases.

3. BACKGROUND OF THE INVENTION

Much progress has been made in the development of high-throughput DNAsequencing technology in recent years (Pettersson E, Lundeberg J,Ahmadian A (February 2009). “Generations of sequencing technologies”.Genomics 93 (2): 105-11. doi:10.1016/j.ygeno.2008.10.003. PMID 18992322;Staden, R (1979 Jun. 11). “A strategy of DNA sequencing employingcomputer programs”. Nucleic Acids Research 6 (7): 2601-10.doi:10.1093/nar/6.7.2601. PMID 461197; Church G M (January 2006).“Genomes for all”. Sci. Am. 294 (1): 46-54.doi:10.1038/scientificamerican0106-46. PMID 16468433). However, acomprehensive analysis of the entire genome is not currentlycommercially available or technologically possible. To date, wholegenome sequencing is used only for research purposes(completegenomics.com/services/standard-sequencing/;illumina.com/services.ilmn), and a medically usefulwhole-genome-sequencing scale service simply does not exist.

While there are some reports of whole-genome-medical sequencingservices, such services utilize information from the whole genome foronly a few disease-associated single nucleotide polymorphisms (SNPs) ina limited number of genes (illumina.com/services.ilmn). This is in partbecause, although ancestral-specific mutations useful for medicalapplications of whole-genome sequencing have been generated in a varietyof diseases (ncbi.nlm.nih.gov/omim), and Genome Wide Association Studies(GWAS) (Klein R J, Zeiss C, Chew E Y, Tsai J Y, Sackler R S, Haynes C,Henning A K, SanGiovanni J P, Mane S M, Mayne S T, Bracken M B, Ferris FL, Ott J, Barnstable C, Hoh J (April 2005). “Complement Factor HPolymorphism in Age-Related Macular Degeneration”. Science 308 (5720):385-9. doi:10.1126/science.1109557. PMC 1512523. PMID 15761122) hasgenerated a partial list of ancestral SNPs for research purposes, acomprehensive list of whole genome-wide ancestral SNPs has not beengenerated to date. Without a comprehensive list of SNPs, the developmentof whole genome sequencing as a medical diagnostic tool may not bepossible

Progress in the area of whole genome sequencing as an approveddiagnostic tool has been impeded largely because medical sequencingmethods developed to date generate a large number of false positives andfalse negatives base calls inherent to the technology (Zhao J, Grant S F(February 2011). “Advances in Whole Genome Sequencing Technology”. CurrPharm Biotechnol 23(2) 293-305. PMID 21050163). There is an additionallayer of misinformation generated in whole genome sequencing due to thecurrent NIH-derived reference genome used as the standard template forsequencing (Scherer, Stewart (2008). A short guide to the human genome.CSHL Press. p. 135. ISBN 0-87969-791-1; Wheeler D A, Srinivasan M,Egholm M, Shen Y, Chen L, McGuire A, He W, Chen Y J, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte C L, Irzyk G P, Lupski J R,Chinault C, Song X Z, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny D M,Margulies M, Weinstock G M, Gibbs R A, Rothberg J M. (2008). “Thecomplete genome of an individual by massively parallel DNA sequencing”.Nature 452 (7189): 872-6; Bibcode 2008Natur.452..872W.doi:10.1038/nature06884. PMID 18421352). In particular, all of existingsequencing technologies utilize the same standard reference genome forthe bioinformatic reconstruction/assembly of the whole genome from thesmall DNA fragments and sequenced during the process of obtaining amedically usable completed whole genome. The current standard referencegenome, which was generated some years ago by the National Institutes ofHealth (NIH) as a model for genomic structure and sequence assembly, isbased on a single whole genome sequence generated from the composite DNAobtained originally from five different individuals (Editorial (October2010). “E pluribus unum”. Nature Methods 331 (5): 331.doi:10.1038/nmeth0510-331). As such, it is neither statisticallysignificant nor accurate when comparing individuals from differentancestral backgrounds and may not provide a statistically significantreference for interpreting genomic information.

Although some sequencing companies claim to have a very high accuracyrate for determining a whole genome sequence (Quail, Michael; Smith,Miriam E; Coupland, Paul; Otto, Thomas D; Harris, Simon R; Connor,Thomas R; Bertoni, Anna; Swerdlow, Harold P; Gu, Yong (1 Jan. 2012). “Atale of three next generation sequencing platforms: comparison of Iontorrent, pacific biosciences and illumina MiSeq sequencers”. BMCGenomics 13 (1): 341. doi:10.1186/1471-2164-13-341; Liu, Lin; Li, Yinhu;Li, Siliang; Hu, Ni; He, Yimin; Pong, Ray; Lin, Danni; Lu, Lihua; Law,Maggie (1 Jan. 2012). “Comparison of Next-Generation SequencingSystems”. Journal of Biomedicine and Biotechnology 2012: 1-11.doi:10.1155/2012/251364), the reality is, due to the large size of thegenome (˜3.2 billion base pairs coding for 20,000 to 25,000 distinctgenes), even a small percentage of errors results in a large number ofbases that are incorrectly called. A very low error rate is required forpredictive medicine applications (Bentley D R (December 2006).“Whole-genome re-sequencing”. Curr. Opin. Genet. Dev. 16 (6): 545-552.doi:10.1016/j.gde.2006.10.009. PMID 17055251; Genetest.org). Recently,bioinformatic tools have been developed that correct genomic sequencebased on familial sequence information for an individual family(familygenomics.systemsbiology.net/publications). Including familialinformation from three closely related individuals can improve DNAsequence accuracy by 70%. Using information from four or more familymembers increases accuracy by 90% (Roach J C, GlussmanG, Smit A F, HuffC D, Drmanac R, Jorde L B, Hood L, Galas D J (10 Apr. 2010) “Analysis ofGenetic Inheritance in a Family Quartet by Whole Genome Sequencing”.Science 328: 636-9 doi:10.3410/f.2707961.2371060). However, suchcorrection tools are time-consuming and add inefficiency and cost to theprocess of whole genome sequencing.

Accordingly, there is a need for the development of anancestral-specific reference genome database that incorporates familialgenome sequencing information to improve the accuracy of suchancestral-specific reference genomes. An ancestral-specific referencedatabases can, in turn, be used as tool, for example, for the diagnosisof a patent at risk for a genetic disease or disorder or for theprognosis of such a genetic disease or disorder.

4. SUMMARY OF THE INVENTION

Provided herein are ancestral-specific reference genome databases thatcan be used, for example, in high-throughput sequencing applications.Such applications include, but are not limited to, increased accuracyfor medical sequencing, improved targeting and safety of drug therapy,and enhanced diagnostic capabilities conducted in a way that providesgreater efficacy due to the ancestral-specific nature of theinformation. In certain embodiments, ancestral-specific reference genomedatabases can be used to identify ancestral-specific therapeutic,diagnostic and prognostic markers and to identify individuals who wouldrespond to therapeutics based on their unique DNA sequence.

In one embodiment the disclosure provide a method for constructing anancestral-specific reference genome, the method comprising: (a)obtaining a familial genome data set comprising DNA sequences frommembers of a family; (b) comparing the DNA sequences within the familialgenome data set to obtain a corrected familial genome data set; (c)preparing a first composite familial genome data set from the correctedfamilial genome data set; (d) repeating steps a-c for a second, third ormore families to obtain a second, third or more composite familialgenome data sets; (e) evaluating the first, second, third or morecomposite familial genome data sets for single nucleotide polymorphisms(SNPs) and/or haplotypes and assigning statistical significance to theSNPs and/or haplotypes; (f) grouping the first, second, third or morecomposite familial genome data sets based on single nucleotidepolymorphisms (SNPs) and/or haplotypes that are statisticallysignificant; and (g) preparing the ancestral-specific reference genomeby compiling the SNPs and/or haplotypes shared by a group of compositefamilial genome data sets with the same ancestry.

In another embodiment, the disclosure provides a method for constructingan ancestral-specific reference genome, the method comprising: (a)obtaining a familial genome data set comprising genome DNA sequencesfrom members of a family; (b) comparing the genome DNA sequences withinthe familial genome data set to obtain a corrected familial genome dataset; (c) preparing a first composite familial genome data set from thecorrected familial genome data set; (d) repeating steps a-c for asecond, third or more families to obtain a second, third or morecomposite familial genome data sets; (e) evaluating the first, second,third or more composite familial genome data sets for single nucleotidepolymorphisms (SNPs) and/or haplotypes and assigning statisticalsignificance to the SNPs and/or haplotypes; (f) grouping the first,second, third or more composite familial genome sequences based onsingle nucleotide polymorphisms (SNPs) and/or haplotypes that arestatistically significant; and (g) preparing the ancestral-specificreference genome by compiling the SNPs and/or haplotypes shared by agroup of composite familial genome data sets with the same ancestry.

Other aspects and features are provided in more detail below.

5. BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 represents an exemplary sequence of steps to generateancestral-specific reference genomes from large amounts of whole genomeDNA sequence information.

FIG. 2 shows a diagrammatic representation of how statisticallysignificant, ancestral-specific SNPs can be used to constructancestral-specific reference genomes.

FIG. 3 represents an example of how ancestry specific reference genomeinformation can be used to decrease the number of erroneous base callsgenerated by whole genome DNA sequencing and the impact on the need tovalidate such erroneous base calls by orthogonal sequencing technology.

FIG. 4 shows a geographic distribution of countries of birth in Inova'swhole genome sequence database. The ten large circles represent the tenancestral genomes identified to date.

FIG. 5 shows the number of sequence variants by ancestry. Each columnrepresents a different ethnicity. The number of variants is presented onthe Y-axis. Individuals of African ancestry have the greatest number ofvariants when genomes are assembled to the standard NIH ReferenceGenome. By comparison, the North American and European Union genomeshave the least variants, because they are derived from the samepopulation used to generate the NIH Reference Genome.

FIG. 6 identifies individual genes that show ancestral-specificdifferences. The FLT3 gene only has variants in the African population.The PBX1 gene has variants in the American and Asian populations, butnot the African and European populations. The TFRC gene only containsvariants in the European population. All other genes have a similarnumber of variants in all ancestries. Population-based differences invariant number demonstrates the variability between populations at thegenetic level and thus the importance of considering ancestry whensequencing a member of an ancestral population.

6. DETAILED DESCRIPTION OF THE INVENTION

Provided herein are ancestral-specific reference genome databases andmethods for their construction. In certain embodiments, theancestral-specific reference genome database comprises a plurality ofancestral-specific reference genomes that are statistically significant,familial corrected, and phased referenced. It is believed that, on awhole genome scale, there are thousands of differences between ancestralgroups that significantly affect how individuals within these differentancestral groups react to drug therapies and that, when disease occurs,can impact their prognosis, diagnosis and therapy(landesbioscience.com/curie/chapter/3119/). Based on the observationthat the DNA of individuals from different ancestral groups containsancestral-specific differences that are important to the interpretationof genomic sequencing information (Kidd, J M; et al. (2008). “Mappingand sequencing of structural variation from eight human genomes”. Nature453 (7191): 56-64. Bibcode 2008Natur.453 . . . 56K.doi:10.1038/nature06862. PMC 2424287. PMID 18451855), a number ofancestral-specific and statistically significant reference genomes weregenerated using a sufficient number of sequenced genomes (currentlygreater than 1,500 whole genomes and growing to >20,000). Suchancestral-specific genomes can be used, for example, to more accuratelyinterpret genomic sequencing information for medically relevantdiagnostic and prognostic purposes.

TERMINOLOGY

The following illustrative explanations are provided to facilitateunderstanding of certain terms used frequently herein, particularly inthe examples. The explanations are provided as a convenience and are notlimitative of the invention.

Diagnostic Marker—A gene or DNA sequence with a known location andsequence on a chromosome that can be used to identify individuals withina species, specifically used in the diagnosis of genetic disease.

Whole Genome Sequencing—A laboratory process that determines thecomplete DNA sequence of an organism's genome at a single time.

Medical-Grade DNA Sequencing—A laboratory process that determines thecomplete DNA sequence of an organism's genome at a single time utilizingtechniques that conform to standard quality laboratory methods outlinedby the Clinical Laboratory Administration Act (CLIA), identifying allclinically relevant variants within a genome.

Haplotype—A group of alleles of different genes on a single chromosomethat are linked closely enough to be inherited as a unit.

Ancestry—Persons initiating or comprising a line of descent.

Familial—Of or relating to a family, a group of people affiliated byconsanguinity.

Database—A usually large collection of data organized especially forrapid search and retrieval (as by a computer).

Genome—One haploid set of chromosomes with the genes they contain;broadly: the genetic material of an organism.

Reference Genome—A digital nucleic acid sequence database, assembled byscientists as a representative example of a species' set of genes.

In silico—An expression used to mean “performed on computer or viacomputer simulation.”

Single Nucleotide Polymorphisms (SNPs)—A DNA sequence variationoccurring when a single nucleotide in the genome differs between membersof a biological species or paired chromosomes in a human

Variant—A single nucleotide polymorphism or rare genetic substitutionevent depending on the frequency in the population.

Minor Allele—An alternative form of the same gene or same genetic locusfound in the minority of the population

Major Allele—An alternative form of the same gene or same genetic locusfound in the majority of the population

6.1 Methods for Generating Ancestral-Specific Reference Genomes

The steps involved in generating the ancestral-specific referencegenomes of such an ancestral-specific reference genome database areoutlined in FIG. 1. The method described herein utilizes whole genomesequence data to generate familial whole genome data sets from amultiple individuals within a family. This sequence is corrected forcontent by comparing the genome sequence of related individuals with thefamily to obtain a corrected familial whole genome data set. Thesecorrected sequences are then compiled into a composite familial wholegenome sequence, in such a manner that ancestral-specific differencesbetween the genomes are identifiable. Familial genomes can then be addedto a composite familial whole genome sequence until the compositefamilial whole genome sequence reaches statistical significance. Uponreaching statistical significance, the composite familial whole genomesequence is evaluated for the presence of SNPs and haplotype.Statistical probabilities are assigned to each SNP and haplotype and thecomposite familial whole genome sequences are grouped according tostatistically significant SNPs and/or haplotypes. The statisticallysignificant SNPs and haploytpes can then be compiled into ancestral SNPand haplotype data sets. The ancestral-specific SNPs and/or haplotypesdata sets can then be used to construct ancestral-specific referencegenomes. A set of such ancestral-specific reference genomes describingancestral and sub-ancestral groups can be utilized, for example, formedical diagnostics and research to target these groups, reducing thenumbers of false positives and false negatives generated, and improvingthe efficiency of whole genome sequencing and enhancing performance ofassays used in the development of personalized medicine applications.

The combination of familial-corrected sequences, ancestral-specificsequences, and statistical significance are all-critical to correctingthe sequence to a sufficient level that the information can be used toevaluate a patient sample for mutations and disease-related SNPs.Without these corrections, the information obtained from DNA sequencingtechnologies generates so many false positives and false negatives thatmedical sequencing is currently outside of the realm of clinical utilityas demonstrated in FIG. 3.

The geographic placement of the country of birth for individual genomesin ITMI's whole genome sequence database, currently comprising more than2,000 whole genome sequences demonstrates that genomes are derived from71 different countries. FIG. 4 shows how these countries of birth can beclustered into 10 ancestral genomes. The size of the circle isproportional to the number of genomes from that country. As more genomesare added to the database, the number of countries will increase,however, the greatest increase will be in the statistical significanceachieved by each reference genome.

The number of variants found in each genome is a function of thedifference between that genome and the NIH reference genome that iscurrently used to assemble and align genomes during the sequencingprocess. The larger the number of variants found in a genome, thegreater the need for a reference genome that accounts for ancestry. FIG.5 shows genomes clustered by ancestry in columns as a function of thenumber of variants on the Y-axis. The African genomes differ the mostfrom the NIH reference genome which is represented by the North Americangenomes. As genomes are assembled, variation from the NIH referencegenome is represented by an increase in the number of variants in awhole genome sequence. The consensus sequence from a group of genomeswithin an ancestry defines the basis of the reference genome that can beused for de novo assembly of genomes containing less variants and arethus more accurate representations of the individual genome.

At the genetic level ancestral variability is observed as differences inthe number of variants in a gene. FIG. 6 shows the minor allelefrequency for ten genes. Of the ten genes, there is ancestralvariability within three. Using ancestral genomes would increase theability to detect these difference at the genetic level and genomiclevel.

In one aspect, provided herein is a method for constructing anancestral-specific reference genome databases comprising the steps of:a) obtaining a familial whole genome data set, comprising whole genomeDNA sequences from three or more individuals of a first family; b)comparing the whole genome DNA sequences within the familial wholegenome data set to obtain a corrected familial whole genome data set; c)preparing a first composite familial whole genome sequence from thecorrected familial whole genome data set, wherein the first compositefamilial whole genome sequence comprises one or more single nucleotidepolymorphisms (SNPs) and/or haplotypes; d) repeating steps a-c forsecond, third or more families to obtain second, third or more compositefamilial whole genome sequences; e) evaluating the first, second, thirdor more composite familial whole genome sequences for single nucleotidepolymorphisms (SNPs) and haplotypes and assigning statisticalprobabilities to each of the SNPs and haplotypes; f) grouping the first,second, third or more composite familial whole genome sequences based onsingle nucleotide polymorphisms (SNPs) and/or haplotypes that arestatistically significant; and g) preparing a plurality ofancestral-specific reference genome, each ancestral-specific referencegenome based on the statistically significant SNPs and/or haplotypesshared by a group of composite familial whole genome sequences with thesame ancestry.

In some embodiments, the method for constructing the ancestral-specificreference genome database comprises the step of obtaining a familialwhole genome data set, comprising whole genome DNA sequences from threeor more individuals of a first family.

As used herein, the term “family” refers to a group of individuals,related by blood, including individuals related to each other by thefirst degree (e.g., parents, full siblings, and children), second degree(grandparents, grandchildren, aunts, uncles, nephews, nieces andhalf-siblings), or third degree (first-cousins, great grandparents, andgreat grandchildren). In some embodiments, the familial whole genomedata set comprises whole genome DNA sequences from four, five, six,seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen,sixteen, seventeen, eighteen, nineteen, twenty or more individuals of afamily. In some embodiments, the familial whole genome data setcomprises whole genome DNA sequences from individuals of a familyrelated to each other by ten degrees or less, by nine degrees or less,by eight degrees or less, by seven degrees, by six degrees or less, byfive degrees, by four degrees or less, by three degrees or less, by twodegrees or less, or by one degree.

Obtaining a familial whole genome data set, comprising whole genome DNAsequences from multiple individuals can be performed by any method knownto those skilled in the art. In certain embodiments, the whole genomeDNA sequences are obtained by performing a DNA sequencing reaction onwhole genome DNA from three or more individuals from the same family. ADNA sequencing reaction can be performed using a commercially availablesequencer such as those developed by Sanger (Sanger F, Coulson A R (May1975). “A rapid method for determining sequences in DNA by primedsynthesis with DNA polymerase”. J. Mol. Biol. 94 (3): 441-8.doi:10.1016/0022-2836(75)90213-2. PMID 1100841), Life Technologies(invitrogen.com/site/us/en/home/Products-and-Services/Applications/Sequencing/Semiconductor-Sequencing/proton.html),Pacific Biosciences (pacificbiosciences.com/), Illumina (illumina.com/)and Complete Genomics (completegenomics.com/) for example. In otherembodiments, the whole genome DNA sequences are obtained from publiclyavailable databases, including, but not limited to, databases developedby the International HapMap Project (hapmap.ncbi.nlm.nih.gov/); theNational Center for Biotechnology Information, National Institutes ofHealth, Bethesda, Md. (NCBI) (ncbi.nlm.nih.gov/); and the EuropeanMolecular Biology Laboratory (EMBL) Nucleotide Sequence Database,Heidelberg, Germany (ebi.ac.uk/embl). In specific embodiments, the wholegenome DNA sequences may, in part, be obtained from HapMap populationsfrom the International HapMap Project.

In some embodiments, the method for constructing the ancestral-specificreference genome database comprises the step of comparing each wholegenome DNA sequence within a familial whole genome data set, in whole orin part, against one another to obtain a corrected familial whole genomedata set. In specific embodiments, the step of comparing whole genomeDNA sequences within a familial whole genome data set comprisescomparing every base position of a whole genome DNA sequence againstother whole genome DNA sequences within the familial whole genome dataset to determine differences in DNA sequences among the whole genome DNAsequences of the familial whole genome data set. In particularembodiments, a difference observed at a base position among the DNAsequences in a familial whole genome data set is validated using anorthogonal technology (e.g., two or more genome sequencing methods asdescribed infra) to determine if the difference is due to an artifact ofthe platform used (e.g., an erroneous base call on the platform) or ifthe difference is a true nucleotide change. Differences in sequences dueto errors are corrected to produce a corrected familial whole genomedata set.

In some embodiments, the method for constructing the ancestral-specificreference genome database comprises the step of preparing a compositefamilial whole genome sequence, in whole or in part, from the correctedfamilial whole genome data set, wherein the composite familial wholegenome sequence comprises one or more single nucleotide polymorphisms(SNPs) and/or haplotypes. Such composite familial whole genome sequencescan be constructed, for example, using the information provided by thecorrected familial whole genome data set, familial inheritance patternsand specifically developed analytic tools and algorithms.

In particular embodiments of the method, the steps of a) obtaining afamilial whole genome data set, comprising whole genome DNA sequences,in whole or in part, from three or more individuals of a family; b)comparing the whole genome DNA sequences within the familial wholegenome data set to obtain a corrected familial whole genome data set; c)preparing a composite familial whole genome sequence from the correctedfamilial whole genome data set, are repeated for a second, third or morefamilies to obtain a second, third or more composite familial wholegenome sequences. In certain embodiments of the method, the steps arerepeated for 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 ormore, 8 or more, 9 or more, 10 or more, 11 or more, 12 or more, 13 ormore, 14 or more, 15 or more, 16 or more, 17 or more, 18 or more, 19 ormore, 20 or more, 30 or more, 40 or more, 50 or more, 60 or more, 70 ormore, 80 or more, 90 or more, or 100 or more families to obtaincomposite familial whole genome sequences for each of the families.

In particular embodiments, the method described herein comprises thestep of evaluating the composite familial whole genome sequences, inwhole or in part, for single nucleotide polymorphisms and/or haplotypesand assigning statistical probabilities to each of the SNPs and/orhaplotypes. Any method known to those skilled in the art can be used toevaluate the presence of single nucleotide polymorphisms and haplotypes,including analytical tools that are available in the public domain (see,e.g., HaploView,broadinstitute.org/scientific-community/science/programs/medical-and-population-genetics/haploview/haploview).Statistical significance of the SNPs and haplotypes are then determinedfor each SNPs and haplotype that are evaluated. A SNP is an“ancestral-specific SNP” if a particular allele of the SNP occurs at afrequency of greater than 99%, 95%, 90%, 85%, 80%, 75%, 70%, 65%, 60%,55% or 50% as compared to the frequency at which it occurs in anotherdistinct composite familial whole genome sequence. In certainembodiments, a SNP is an “ancestral-specific SNP” if it occurs at afrequency of greater than 99% as compared to the frequency at which itoccurs in another distinct composite familial whole genome sequence. Incertain embodiments, a SNP is an “ancestral-specific SNP” if it occursat a frequency of greater than 95% as compared to the frequency at whichit occurs in another distinct composite familial whole genome sequence.In certain embodiments, a SNP is an “ancestral-specific SNP” if itoccurs at a frequency of greater than 90% as compared to the frequencyat which it occurs in another distinct composite familial whole genomesequence. In certain embodiments, a SNP is an “ancestral-specific SNP”if it occurs at a frequency of greater than 85% as compared to thefrequency at which it occurs in another distinct composite familialwhole genome sequence. In certain embodiments, a SNP is an“ancestral-specific SNP” if it occurs at a frequency of greater than 80%as compared to the frequency at which it occurs in another distinctcomposite familial whole genome sequence. In certain embodiments, a SNPis an “ancestral-specific SNP” if it occurs at a frequency of greaterthan 75% as compared to the frequency at which it occurs in anotherdistinct composite familial whole genome sequence. In certainembodiments, a SNP is an “ancestral-specific SNP” if it occurs at afrequency of greater than 70% as compared to the frequency at which itoccurs in another distinct composite familial whole genome sequence. Incertain embodiments, a SNP is an “ancestral-specific SNP” if it occursat a frequency of greater than 65% as compared to the frequency at whichit occurs in another distinct composite familial whole genome sequence.In certain embodiments, a SNP is an “ancestral-specific SNP” if itoccurs at a frequency of greater than 60% as compared to the frequencyat which it occurs in another distinct composite familial whole genomesequence. A haplotype is “ancestral-specific” if a particular haplotypeoccurs at a frequency of greater than 99%, 95%, 90%, 85%, 80%, 75%, 70%,65%, 60%, 55% or 50% as compared to the frequency at which it occurs inanother distinct composite familial whole genome sequence. In particularembodiments, these ancestral-specific SNPs/haplotypes are then used togenerate ancestral-specific reference genomes of the ancestral-specificreference genome database.

In certain embodiments, an ancestral-specific reference genome of theancestral-specific reference genome database comprises up to andincluding 1×10⁶ or more SNPs. In certain embodiments, anancestral-specific reference genome of the ancestral-specific referencegenome database comprises up to and including 1.5×10⁶ or more SNPs. Incertain embodiments, an ancestral-specific reference genome of theancestral-specific reference genome database comprises up to andincluding 2.0×10⁶ or more SNPs.

In certain embodiments, an ancestral-specific reference genome of theancestral-specific reference genome database comprises up to andincluding 2.5×10⁶ or more SNPs.

In certain embodiments, an ancestral-specific reference genome of theancestral-specific reference genome database comprises up to andincluding 3×10⁶ or more SNPs.

In certain embodiments, an ancestral-specific reference genome of theancestral-specific reference genome database comprises up to andincluding 3.5×10⁶ or more SNPs.

In certain embodiments, an ancestral-specific reference genome of theancestral-specific reference genome database comprises up to andincluding 4×10⁶ or more SNPs. In certain embodiments, anancestral-specific reference genome of the ancestral-specific referencegenome database comprises up to and including 4.5×10⁶ or more SNPs. Incertain embodiments, an ancestral-specific reference genome of theancestral-specific reference genome database comprises up to andincluding 5×10⁶ or more SNPs. In certain embodiments, anancestral-specific reference genome of the ancestral-specific referencegenome database comprises up to and including 5.5×10⁶ or more SNPs. Incertain embodiments, an ancestral-specific reference genome of theancestral-specific reference genome database comprises up to andincluding 6×10⁶ or more SNPs. In certain embodiments, anancestral-specific reference genome of the ancestral-specific referencegenome database comprises up to and including 6.5×10⁶ or more SNPs. Incertain embodiments, an ancestral-specific reference genome of theancestral-specific reference genome database comprises up to andincluding 7×10⁶ or more SNPs. In certain embodiments, anancestral-specific reference genome of the ancestral-specific referencegenome database comprises up to and including 7.5×10⁶ or more SNPs. Incertain embodiments, an ancestral-specific reference genome of theancestral-specific reference genome database comprises up to andincluding 8×10⁶ or more SNPs. In certain embodiments, anancestral-specific reference genome of the ancestral-specific referencegenome database comprises up to and including 8.5×10⁶ or more SNPs. Incertain embodiments, an ancestral-specific reference genome of theancestral-specific reference genome database comprises up to andincluding 9×10⁶ or more SNPs. In certain embodiments, anancestral-specific reference genome of the ancestral-specific referencegenome database comprises up to and including 9.5×10⁶ or more SNPs. Incertain embodiments, an ancestral-specific reference genome of theancestral-specific reference genome database comprises up to andincluding 1×10⁷ or more SNPs. In certain embodiments, anancestral-specific reference genome of the ancestral-specific referencegenome database comprises up to and including 1.5×10⁷ or more SNPs. Incertain embodiments, an ancestral-specific reference genome of theancestral-specific reference genome database comprises up to andincluding 2×10⁷ or more SNPs. In certain embodiments, anancestral-specific reference genome of the ancestral-specific referencegenome database comprises up to and including 3×10⁷ or more SNPs. Incertain embodiments, an ancestral-specific reference genome of theancestral-specific reference genome database comprises up to andincluding 4×10⁷ or more SNPs. In certain embodiments, anancestral-specific reference genome of the ancestral-specific referencegenome database comprises up to and including 5×10⁷ or more SNPs. Theancestral-specific SNPs identified using this method can be used togenerate a composite ancestral-specific reference genome for eachancestral group analyzed.

The method described above can then be repeated and refined usingsubsets of individuals from each ancestral group. For example, theEuropean ancestral-specific reference genome may be subdivided into anEastern European-specific reference genome, Northern European specificreference genome, etc.

The triangle shown in FIG. 2 depicts how this information is used togenerate ancestral-specific reference genomes. Each of the corners ofthe triangle shown in FIG. 2 represents an ancestral group, i.e.,European, African, or Asian. Markers that plot at the corners of thetriangle represent ancestral-specific SNPs. For example, points thatplot in the corner at the bottom right-hand sector of the trianglerepresent SNPs that are specific to individuals of European ancestry,because these variants occur in individuals of European ancestry, butnot in individuals of African or Asian ancestries.

6.2 Uses of Ancestral-Specific Reference Genomes

The ancestral-specific reference genomes in whole or in part describedherein have applications in the fields of analysis, DNA-baseddiagnostics, DNA sequencing, pharmaceutical drug development andclinical application of genomic information. These reference genomesmake it possible to analyze whole genome or exome sequence data togenerate more meaningful results by eliminating false positives andfalse negatives from the sequence data. The improved accuracy providedby ancestral-specific reference genomes permit the elimination oferroneous data. See FIG. 3.

The more accurate set of SNP and/or haplotype data generated from theresults of this analysis may be placed in the context of other data,such as proteomic or pathway data, resulting in a more accurateinterpretation of the impact of SNPs and/or haplotypes in the context ofdisease or for other applications as described in the examples listedbelow.

7. EXAMPLES 7.1 Enhanced Diagnostics

The field of DNA-based diagnostics relies on the ability to accuratelyidentify DNA sequence, specifically in the nucleotide residues thatresult in disease-causing sequence variation. Accuracy of variantidentification by sequence analysis is a major rate limiting step in thedevelopment of novel diagnostic markers and their use in testing thepopulation. Variants identified utilizing enhanced reference genometranslates into more accurate diagnostic markers and more accuratediagnostic tests. The utility of the reference genome for improvingvariant identification is independent of the technology used to generatevariant information. By applying the information contained in thereference genome to the sequence technology utilized to generate thevariant information, the interpretation of the variant information inenhanced. These markers can be used for prognostic or diagnostic testingfor counseling of patients or as companion diagnostics forpharmaceutical compounds.

There are almost 1000 gene/SNP specific diagnostic tests available formedical diagnostics. This number is relatively small compared to thelarge number of potential disease-causing variants in the genome. Thesedisease-causing variants occur in genetic disorders to include, but notlimited to: Achromatopsia, Aicardi Syndrome, Albinism, AlexanderDisease, Alpers' Disease, Alzheimer's Disease, Angelman Syndrome,Autism, Bardet-Biedl Syndrome, Barth Syndrome, Best's Disease, BipolarDisorder, Bloom Syndrome, Canavan Syndrome, Cancer, including BreastCancer, Prostate Cancer, Ovarian Cancer, and other forms of cancer,including cancers resultant from germ-line and somatic mutations,Carnitine Deficiencies, Cerebral Palsy, Coffin Lowry Syndrome, HeartDefects, Hip Dysplasia, Cooley's Anemia, Corneal Dystrophy, CysticFibrosis, Cystinosis Diabetes, Down Syndrome, Epidermolysis Bullosa,Familial Dysautonomia, Fibrodysplasia, Fragile X Syndrome, DeficiencyAnemia, Galactosemia, Gaucher Disease, Gilbert's Syndrome, Glaucoma,Hemochromatosis, Hemoglobin C Disease, Hemophilia/Bleeding Disorders,Hirschsprung's Disease, Homocystinuria, Huntington's Disease, HurlerSyndrome, Klinefelter Syndrome, Macular Degeneration, Marshall Syndrome,Menkes Disease, Metabolic Disorders, Microphthalmus, MitochondrialDisease, Mucolipidoses, Muscular Dystrophy, Neonatal Onset MultisystemInflammatory Disease, Neural Tube Defects, Noonan Syndrome, OpticAtrophy, Osteogenesis Imperfecta, Peutz-Jeghers Syndrome,Phenylketonuria (PKU), Pseudoxanthoma Elasticum, Progeria, ScheieSyndrome, Schizophrenia, Sickle Cell Anemia, Skeletal Dysplasias,Spherocytosis, Spina Bifida, Spinocerebellar Ataxia, Stargardt Disease(Macular Degeneration), Stickler Syndrome, Toy-Sachs Disease,Thalassemia, Treacher Collins Syndrome, Tuberous Sclerosis, Turner'sSyndrome, Urea Cycle Disorder, Usher's Syndrome or Werner Syndrome.

7.2 Ancestral-Specific Pharmaceutical Development

The development of pharmaceutical compounds is currently limited by theability to identify groups within the general population that respondeither favorably or unfavorably to a pharmaceutical compound. Forexample, it is possible to develop a pharmaceutical compound that hastherapeutic efficacy in a sub-population, but the therapeutic effect maybe obscured because that sub-population represents a minority in thegeneral population. Similarly, it is possible to develop apharmaceutical compound that has therapeutic efficacy in onesub-population, but has significant deleterious side effects in anothersub-population. For this reason, it is advantageous to develop andevaluate pharmaceutical compounds at the sub-population level. Theancestral-specific nature of these reference genomes is critical to thedevelopment of ancestral-specific pharmaceutical compounds. Aspharmaceutical companies are encouraged by the Food and DrugAdministration (FDA) and economic factors to produce more narrowlyfocused therapeutics and diagnostics, these reference genomes providethe ability to determine in advance if a therapeutic is effective in asubgroup of the population.

7.3 Medical-Grade DNA Sequencing

Current DNA sequencing using the existing reference genomes is forresearch purposes only. Companies that claim to perform medical-gradeDNA sequencing are utilizing research quality materials and methods in aCLIA environment to evaluate a limited number of variants in a smallsubset of the genes contained within the genome. The false positive andfalse negative errors introduced into the DNA sequence are the combinedresult of technological issues and the use of an inaccurate referencegenome. Use of the ancestral reference genomes described herein providesa more accurate DNA sequencing method for the development of medicalsequencing on a commercially feasible scale.

Currently, all DNA sequencing companies utilize the existing NIHreference genome; however, tailoring the reference to the particulargenealogic background of the individual improves efficiency and accuracyof the final product. The current NIH reference genome is of limitedutility because the sequence was generated from the DNA of only fiveindividuals without regard to ancestry. Numerous versions of the NIHreference genome have been generated, correcting the reference sequenceutilizing a variety of different datasets that also contain no ancestralinformation. The result is a reference genome that lacks statisticalsignificance and haplotype information, and focuses only on majorvariants found in a single ancestry. Often, only minor variants areidentified for nucleotide positions within the genome, or no call can bemade based on the inability for current base-calling software todistinguish between two or more variants localized to the samenucleotide position. Ancestral-specific reference genomes that have beencorrected with familial and haplotype information provide a mechanismfor improving the quality of DNA sequencing to the point where it ismedically useful.

The use of the ancestral reference genomes enhances the ability ofclinicians to apply genomic information to their patients. If thegenealogy of a patient is known or can be determined by the DNA sequenceof the individual or family members, the clinician can use thatinformation to determine which therapy may best suit the needs and thesafety of the patient based on the availability of ancestral-specifictherapeutic compounds.

7.4 Identification of Personal Attributes for Non-Medical Purposes

In another aspect, provided herein is an example of usingancestral-specific reference genomes, in whole or in part, fornon-medical applications which utilize genomic sequence and SNP data toinform an individual about personal attributes such as ancestry, gender,compatibility between individuals based on actual or perceived physical,biological or psychological attributes, genetic compatibility or otherinformation that can be obtained about an individual from their sequenceinformation. This example specifically enables individuals to learn moreabout potential partners by comparing genomic information that has beenenhanced for accuracy with ancestral-specific reference information.Other applications also exist. For example, individuals may use thereference genomes to compare the variant profile of their genes forphysical ability, intellectual capacity or musical talent with areference genome to improve the accuracy of comparisons.

In one embodiment, the method for identifying an individual attribute inan individual such as ancestry, personal compatibility, a physicalattribute, a biological attribute, a psychological attribute or geneticcompatibility, comprises the step of comparing a DNA sequence of anindividual with any one or more of the ancestral-specific referencegenomes of the ancestral-specific reference genome databases, whereinthe one or more ancestral-specific reference genomes comprises one ormore single nucleotide polymorphisms and/or haplotypes associated with aknown individual attribute, and determining whether the DNA sequence ofthe individual also comprises the one or more single nucleotidepolymorphisms and/or haplotypes associated with the known individualattribute.

7.5 Forensic Science Applications

In certain embodiments, the methods of using the ancestral-specificreference genome databases for forensic applications include, but arenot limited to, paternity testing, improving identification of living ordeceased individuals where conventional methods of identification fail,such as in a bomb blast, mass grave or natural disasters such asearthquakes and tidal waves. In the event that conventional methods ofidentification, such as fingerprint analysis, dental record review orDNA specific information that can be used to identify a person,comparison to reference genomes can provide information about a person'sancestry. For example, such an analysis could determine if a deceasedindividual is of Northern European versus Southern European descent,providing rescue groups or law enforcement or government agencies withinformation about a person's identity that they otherwise would nothave.

7.6 Law Enforcement Applications

In other embodiments, the ancestral-specific reference genome databasesand methods provided herein may be used in law-enforcement applications,such as the ancestral classification of an individual when a sample oftheir DNA is available that does not match an individual in lawenforcement databases. Under such conditions, an unknown individual'sDNA is used to determine the ancestry of the individual, making itpossible to eliminate individuals outside of that ancestry as suspectsand focusing the search for the guilty party to individuals from aspecific ancestry. In another embodiment, ancestral reference genomes isused by government agencies such as the FBI or Department of HomelandSecurity to identify the ancestry of persons of interest such asterrorists, thus narrowing the search for persons of interest to aspecific ancestry. In another embodiment, ancestral-specific referencegenomes are applied to DNA-based information contained within FBIdatabases to improve the accuracy of identification of an individual.The improved accuracy resulting from the use of ancestral-specificreference genomes increases the statistical likelihood that the FBI hasarrested the correct individual.

7.7 Reproduction Technologies

In another aspect, a method of using one or more ancestral-specificreference genome(s), in whole or in part, of an ancestral-specificreference genome database described herein for the selection of embryos,eggs or sperm for artificial reproduction. This includes the geneticevaluation of embryos, eggs and sperm for the detection of geneticdisease, genomic disease, pharmacogenomic applications, determination ofrelatedness of individuals or the selection of physical attributes suchas eye color or hair color or the identification of other attributes ofinterest to couples, physicians or scientists.

This also relates to paternity testing and to the typing of embryos forin vitro fertilization to minimize ancestral-related diseases such as infounder situations in inbred populations such as the Amish and AshkenaziJewish populations and to minimize the risk of genetic disease fromreproduction by related individuals. In some embodiments, the methodcomprises the step of comparing a DNA sequence of an embryo, egg and/orsperm with any one or more of the ancestral-specific reference genomesof the ancestral-specific reference genome database of claim 1, whereinthe one or more of the ancestral-specific reference genomes comprisesone or more single nucleotide polymorphisms and/or haplotypes associatedwith a known genetic diseases, genomic attribute or physicalcharacteristic, and determining whether the DNA sequence of theindividual also comprises the one or more single nucleotidepolymorphisms and/or haplotypes associated with the known geneticdiseases, genomic attribute or physical characteristic. In someembodiments, the method comprises the step of comparing a DNA sequenceof a sperm or egg of a first individual and the DNA sequence of a spermor egg of a second individual with one or more ancestral-specificreference genomes, in whole or in part, of an ancestral-specificreference genome database described herein to determine the relatednessof the first individual and the second individual. The use ofancestral-specific reference genomes makes the analysis more accuratethat current sequence analysis that utilizes the existing referencegenome and thus increases the likelihood of the preferred outcome.

7.8 Non-Human Uses

In another aspect, provided herein is a method of using ancestralreference genomes, in whole or in part in other species for theselection of attributes. This includes, but is not limited to, the useof human and non-human reference genomes for identification ofrecombinant organisms that contain desired genotypes that may or may notconfer a phenotype in the individual or lineage being evaluated. In oneexample, a “humanized mouse” animal model created in the laboratory tocontain a part of or an entire human chromosome is evaluated forfunctional genes or DNA sequences contained in the hybrid. The advantageof utilizing ancestral-specific reference genomes is the improveaccuracy of the DNA sequencing performed on these samples to ensure thatthe researcher is utilizing organisms that carry the variants necessaryto achieve research goals.

In another embodiment, the reference genomes is used to improve theaccuracy with which eggs, sperm or embryos are identified for theselective breeding of livestock, or the selection of microorganisms forresearch or industrial purposes, similar to its use in humans forreproductive technologies. In such instances, an organism-specificreference genome is created to facilitate the discrimination betweendifferent variants.

7.9 In Silico Genomics

In another aspect provided herein is a system comprising: (1) a centralprocessing unit and (2) a memory coupled to the central processing unit,the memory storing one or more ancestral-specific reference genomedatabases provided herein. In certain embodiments, the memory furtherstores a nucleic acid comparison computer program wherein the nucleicacid sequencing computer program is capable of comparing the nucleicacid sequence of a sample nucleic acid with the plurality ofancestral-specific reference genomes of the one or moreancestral-specific reference genome databases to determine the presenceof one or more ancestral-specific reference genome SNPs or haplotypes inthe nucleic acid sequence of the sample nucleic acid sequence. In otherembodiments, the system further comprises a user computer comprising anaccess software computer program that allows the access of the one ormore ancestral-specific reference genome databases from the servercomputer. In yet other embodiments, the user computer further comprisesa nucleic acid comparison computer program wherein the nucleic acidsequencing computer program is capable of comparing the nucleic acidsequence of a sample nucleic acid with the plurality ofancestral-specific reference genomes of the one or moreancestral-specific reference genome databases to determine the presenceof one or more ancestral-specific reference genome SNPs or haplotypes inthe nucleic acid sequence of the sample nucleic acid sequence.

The embodiments described herein are intended to be merely exemplary,and those skilled in the art will recognize, or be able to ascertainusing no more than routine experimentation, numerous equivalents to thespecific procedures described herein. All such equivalents areconsidered to be within the scope of the present invention and arecovered by the following claims.

LIST OF REFERENCES

-   Pettersson E, Lundeberg J, Ahmadian A (February 2009). “Generations    of sequencing technologies”. Genomics 93 (2): 105-11.    doi:10.1016/j.ygeno.2008.10.003. PMID 18992322.-   Staden, R (1979 Jun. 11). “A strategy of DNA sequencing employing    computer programs.”. Nucleic Acids Research 6 (7): 2601-10.    doi:10.1093/nar/6.7.2601. PMID 461197-   Church G M (January 2006). “Genomes for all”. Sci. Am. 294 (1):    46-54. doi:10.1038/scientificamerican0106-46. PMID 16468433-   completegenomics.com/services/standard-sequencing-   illumina.com/services.ilmn-   ncbi.nlm.nih.gov/omim-   Klein R J, Zeiss C, Chew E Y, Tsai J Y, Sackler R S, Haynes C,    Henning A K, SanGiovanni J P, Mane S M, Mayne S T, Bracken M B,    Ferris F L, Ott J, Barnstable C, Hoh J (April 2005). “Complement    Factor H Polymorphism in Age-Related Macular Degeneration”. Science    308 (5720): 385-9. doi:10.1126/science.1109557. PMC 1512523. PMID    15761122-   Zhao J, Grant S F (February 2011). “Advances in Whole Genome    Sequencing Technology”. Curr Pharm Biotechnol 23(2) 293-305. PMID    21050163-   Scherer, Stewart (2008). A short guide to the human genome. CSHL    Press. p. 135. ISBN 0-87969-791-1.-   Wheeler D A, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He    W, Chen Y J, Makhijani V, Roth G T, Gomes X, Tartaro K, Niazi F,    Turcotte C L, Irzyk G P, Lupski J R, Chinault C, Song X Z, Liu Y,    Yuan Y, Nazareth L, Qin X, Muzny D M, Margulies M, Weinstock G M,    Gibbs R A, Rothberg J M. (2008). “The complete genome of an    individual by massively parallel DNA sequencing”. Nature 452 (7189):    872-6. Bibcode 2008Natur.452..872W. doi:10.1038/nature06884. PMID    18421352-   Editorial (October 2010). “E pluribus unum”. Nature Methods 331    (5): 331. doi:10.1038/nmeth0510-331-   Quail, Michael; Smith, Miriam E; Coupland, Paul; Otto, Thomas D;    Harris, Simon R; Connor, Thomas R; Bertoni, Anna; Swerdlow, Harold    P; Gu, Yong (1 Jan. 2012). “A tale of three next generation    sequencing platforms: comparison of Ion torrent, pacific biosciences    and illumina MiSeq sequencers”. BMC Genomics 13 (1): 341.    doi:10.1186/1471-2164-13-341-   Liu, Lin; Li, Yinhu; Li, Siliang; Hu, Ni; He, Yimin; Pong, Ray; Lin,    Danni; Lu, Lihua; Law, Maggie (1 Jan. 2012). “Comparison of    Next-Generation Sequencing Systems”. Journal of Biomedicine and    Biotechnology 2012: 1-11. doi:10.1155/2012/251364-   Bentley D R (December 2006). “Whole-genome re-sequencing”. Curr.    Opin. Genet. Dev. 16 (6): 545-552. doi:10.1016/j.gde.2006.10.009.    PMID 17055251; Genetest.org-   familygenomics.systemsbiology.net/publications-   Roach J C, GlussmanG, Smit A F, Huff C D, . . . Drmanac R, Jorde L    B, Hood L, Galas D J (10 Apr. 2010) “Analysis of Genetic Inheritance    in a Family Quartet by Whole Genome Sequencing”. Science 328: 636-9    doi:10.3410/f.2707961.2371060-   landesbioscience.com/curie/chapter/3119/Kidd,-   Kidd, J M; et al. (2008). “Mapping and sequencing of structural    variation from eight human genomes”. Nature 453 (7191): 56-64.    Bibcode 2008Natur.453 . . . 56K. doi:10.1038/nature06862.    PMC 2424287. PMID 18451855-   Sanger F, Coulson A R (May 1975). “A rapid method for determining    sequences in DNA by primed synthesis with DNA polymerase”. J. Mol.    Biol. 94 (3): 441-8. doi:10.1016/0022-2836(75)90213-2. PMID 1100841-   invitrogen.com/site/us/en/home/Products-and-Services/Applications/Sequencing/Semiconductor-Sequencing/proton.html-   pacificbiosciences.com/-   illumina.com/-   completegenomics.com/-   hapmap.ncbi.nlm.nih.gov/-   ncbi.nlm.nih.gov/-   ebi.ac.uk/embl/

All references (including patent applications, patents, andpublications) cited herein are incorporated herein by reference in theirentirety and for all purposes to the same extent as if each individualpublication or patent or patent application was specifically andindividually indicated to be incorporated by reference in its entiretyfor all purposes.

What is claimed is:
 1. A method for constructing an ancestral-specificreference genome, the method comprising: a) obtaining a familial genomedata set comprising DNA sequences from members of a family; b) comparingthe DNA sequences within the familial genome data set to obtain acorrected familial genome data set; c) preparing a first compositefamilial genome data set from the corrected familial genome data set; d)repeating steps a-c for a second, third or more families to obtain asecond, third or more composite familial genome data sets; e) evaluatingthe first, second, third or more composite familial genome data sets forsingle nucleotide polymorphisms (SNPs) and/or haplotypes and assigningstatistical significance to the SNPs and/or haplotypes; f) grouping thefirst, second, third or more composite familial genome data sets basedon single nucleotide polymorphisms (SNPs) and/or haplotypes that arestatistically significant; and g) preparing the ancestral-specificreference genome by compiling the SNPs and/or haplotypes shared by agroup of composite familial genome data sets with the same ancestry. 2.The method of claim 1, wherein the DNA sequences are compared with theancestral-specific reference genome with a nucleic acid comparisoncomputer program.
 3. The method of claim 1, further comprising the stepof recording the ancestral-specific reference genome database onto astorage medium.
 4. The method of claim 3, wherein the storage medium isa tangible storage medium or a cloud-based storage solution.
 5. Themethod of claim 1, wherein the ancestral-specific reference genome isprepared by compiling the SNPs and/or haplotypes shared at a frequencyof greater than 90%.
 6. The method of claim 1, wherein theancestral-specific reference genome is prepared by compiling the SNPsand/or haplotypes shared at a frequency of greater than 95%.
 7. Themethod of claim 1, wherein the ancestral-specific reference genome isprepared by compiling the SNPs and/or haplotypes shared at a frequencyof greater than 99%.
 8. The method of claim 1, wherein theancestral-specific reference genome has 1×10⁶ or more SNPs.
 9. Themethod of claim 1, wherein the ancestral-specific reference genome has3×10⁶ or more SNPs.
 10. The method of claim 1, wherein theancestral-specific reference genome has 6×10⁷ or more SNPs.
 11. Themethod of claim 1, wherein the members of the family are animals and theanimals are first degree, second degree, or third degree bloodrelatives.
 12. The method of claim 1, wherein the familial genome dataset comprises DNA sequences from three or more family members.
 13. Themethod of claim 1, wherein the DNA sequences include whole genome DNAsequences and derivatives thereof.
 14. The method of claim 1, whereinthe DNA sequences include exome sequence data and derivatives thereof.15. A method for constructing an ancestral-specific reference genome,the method comprising: a) obtaining a familial genome data setcomprising genome DNA sequences from members of a family; b) comparingthe genome DNA sequences within the familial genome data set to obtain acorrected familial genome data set; c) preparing a first compositefamilial genome data set from the corrected familial genome data set; d)repeating steps a-c for a second, third or more families to obtain asecond, third or more composite familial genome data sets; e) evaluatingthe first, second, third or more composite familial genome data sets forsingle nucleotide polymorphisms (SNPs) and/or haplotypes and assigningstatistical significance to the SNPs and/or haplotypes; f) grouping thefirst, second, third or more composite familial genome sequences basedon single nucleotide polymorphisms (SNPs) and/or haplotypes that arestatistically significant; and g) preparing the ancestral-specificreference genome by compiling the SNPs and/or haplotypes shared by agroup of composite familial genome data sets with the same ancestry. 16.The method of claim 15, wherein the DNA sequences are compared with theancestral-specific reference genome with a nucleic acid comparisoncomputer program.
 17. The method of claim 15, further comprising thestep of recording the ancestral-specific reference genome database ontoa storage medium.
 18. The method of claim 17, wherein the storage mediumis a tangible storage medium or a cloud-based storage solution.
 19. Themethod of claim 15, wherein the ancestral-specific reference genome isprepared by compiling the SNPs and/or haplotypes shared at a frequencyof greater than 90%.
 20. The method of claim 15, wherein theancestral-specific reference genome has 6×10⁷ or more SNPs.
 21. Themethod of claim 15, wherein the familial whole genome data set comprisesDNA sequences from three or more family members.
 22. The method of claim15, wherein the family members are human.